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We have developed molecular dynamics codes for a short-range interaction potential that adopt 
both the flat-MPI and MPI/OpenMP hybrid parallelizations on the basis of a full domain decom- 
position strategy. Benchmark simulations involving up to 38.4 billion Lennard- Jones particles were 
performed on PRIMEHPC FXIO, consisting of 4800 SPARC64 IXfx 1.848 GHz processors, at the 
I Information Technology Center of the University of Tokyo, and a performance of 193 teraflops was 

O, achieved, which corresponds to a 17.0% execution efficiency. Cavitation processes were also simu- 

lated on PRIMEHPC FXIO and SGI Ahix ICE 8400EX at the Institute of Solid State Physics of the 
^SJ ' University of Tokyo, which involved 1.45 billion and 22.9 million particles, respectively. Ostwald-like 

ripening was observed after the multibubble nuclei. Our results demonstrate that direct simulations 
of multiscale phenomena involving phase transitions from the atomic scale are possible and that the 
_ . molecular dynamics method is a promising method that can be applied to petascale computers. 
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I. INTRODUCTION 

b 

• ] Most recent computers consist of multiple nodes, with each node usually having multiple processors. Additionally, 
^ ■ not only state-of-the-art computer but commodity cluster systems adopt hierarchical memory models, i.e., shared 
• ^ I memory within a node and distributed memory across the nodes. Corresponding to the different memory models, 
sj^. there are two programming models, the Message Passing Interface (MPI) library and OpenMP, which are the de 

■ facto standards for threading distributed and shared memory models, respectively. To achieve better performance on 
^ f systems with such a heterogeneous memory model, programmers have to choose their strategies for parallelization 

from the following three possible approaches; fully distributed memory models, global address space models, and a 
^ , combination of distributed and shared memory models. The fully distributed memory models are achieved by the 
^ ■ so-called flat-MPI strategy, i.e., shared memory in a node is treated as distributed memory and only the MPI library is 
] used for both intra- and internode communication. This strategy reduces development costs, since a programmer can 
. ignore the heterogeneity of the memory model in the system. Also, the flat-MPI approach usually achieves acceptable 

■ performance since the MPI library is usually optimized so that it can utilize a shared memory model within a 
, node. The global address space models allow programmers to use a virtually global memory address space, which 

■ is distributed over nodes. Using partitioned global address space (PGAS) languages, programmers can control the 
^-H , affinity of the memory space to threads. There are many languages that adopt this programming model, for example, 
CN ■ Unified Parallel C j^], Co- Array Fortran 0], Titanium , XcalableMP [3| , and so forth. While these languages reduce 
\ [ \ the development cost of software, users have to utilize their features to the full to obtain sufficient performance for 

^ ■ large-scale computations. Finally, the combination of distributed and shared memory models is usually achieved by 
using the MPI library for intranode communication and OpenMP for intranode communication. Although it seems 
^ . , natural to adopt the distributed memory model across nodes and the shared memory model within each node, the 

■ flat-MPI approach usually gives better performance than the hybrid approach. However, one cannot always adopt 
the flat-MPI approach since it requires more resources such as the total memory, especially for massively parallel 
simulations. Many comparisons between the flat-MPI approach and the OpenMP/MPI hybrid approach have been 
carried out. However, OpenMP/MPI hybrid programming is usually implemented by domain decomposition with 
the MPI library and loop-level decomposition with OpenMP. Even if the total systems of two schemes are identical, 
the computations applied to each core are different. This makes difficult to compare the efficiency between the two 
schemes, since it is difficult to find reasons for the difference in performance. 

In this paper, we describe a pseudo-flat-MPI implementation of molecular dynamics (MD) simulations that allows 
us to compare the performance between flat-MPI and a hybrid using the same program. We adopt the domain 
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decomposition strategy for both intra- and internode parallelizations. We also focus on how to perform product runs 
using the hybrid parahehzed MD, since the development cost of implementing codes for product runs ca be compared 
with that of implementing benchmark codes. Programmers have to consider many factors, such as the parallelization 
of observation codes, file lOs, and memory consumption. The scaling analysis of benchmark simulations involving up 
to 38.4 biUion Lennard- Jones (LJ) particles is performed on PRIMEHPC FXIO, consisting of 4800 SPARC64 IXfx 
1.848 GHz processors, at the Information Technology Center of the University of Tokyo. To achieve better performance 
on FXIO, we optimize the simulation kernel and achieve a performance of 193 teraflops, which corresponds to a 17.0% 
execution efficiency. We also perform substantial simulations of cavitation phenomena involving up to 1.45 billion 
particles on two different platforms, FXIO and SGI Altix ICE 8400 EX (Intel Xeon 2.93 GHz processors), at the 
Institute of Solid State Physics of the University of Tokyo. 

This paper is organized as follows. In Sec. [Ill details of our method are described including the parallelization. 
Benchmark results are given in Sec. IIIII and cavitation phenomena are studied in Sec. IIVI Finally, Sec. |V]is devoted 
to a summary and discussions of this study. 



II. METHOD 



A. Basic Algorithms 



We perform classical MD simulations of LJ potentials with truncation, 
for short-range MD. Refer to a previous paper for detailed descriptions Q 
particles is given by @ 



Here, we briefly describe the algorithms 
The two-body potential of truncated LJ 
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with particle distance r, well depth e, and atomic diameter a. The two additional coefficients, C2 and cq, are determined 
so that the potential and force become zero at the cutoff length Vc, i.e., V{rc) — V'{rc) = 0. In the following, we 
use physical quantities reduced by cr, e, and the Boltzmann constant /cb, i.e., the length scale is measured using 
the unit of cr, and so forth. We set the cutoff length to — 2.5 for benchmark simulations and to = 3.0 for 
simulations of cavitation. Since interactions between particles are short-range, it is necessary to find particle pairs 
separated by a distance of them is less than the cutoff length. We adopt a grid algorithm [7l-[l0| (also called the 
linked- list method [HI, [l^]) to find the interacting particle pairs. Since the computational cost of constructing a list 
of interacting particle pairs is high, we adopt the bookkeeping method to reduce the cost of constructing the list [l^ , 
which allows us to reuse the same pair list for several time steps by registering pairs within some length which is longer 
than the cutoff length. Several steps after the construction of the pair list, there may be particle pairs separated by a 
distance less than the cutoff length that are not registered in the list. If such pairs exist, the simulation fails since the 
total energy will not be conserved. Therefore, we have to check the validity of the list in every time step. For this, 
we adopt the dynamic upper time cutoff (DUTC) method 1^. The validity of the list is checked using the maximum 
velocity and the displacements of the particles [5] . This validity check involves the global synchronization between all 
processes. 

After constructing a pair list, we sort the list to reduce the number of accesses to the memory (see Fig.[T|). Consider 
a particle pair {i,j), where i < j. We call the particle of index i the key particle and the other the partner particle. 
We sort the particle pairs by the indices of the key particles. Then the data of the key particle is stored in CPU 
registers, and the amount of memory access is decreased since data fetching and storing are performed only for the 
data of partner particles. 
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FIG. 1: (a) Sorted pair list. The pair list is sorted by the indices of key particles, (b) Data format of the sorted pair 
list. A sorted pair list is expressed by two arrays; KeyPointer and SortedList. The first partner of key particle i is stored in 
SortedList[KeyPointer[i]], the second is stored in SortedList [KeyPointer [i] -1-1], and so forth. 
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B. Parallelization 
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FIG. 2: Schematic illustration of the pseudo fiat-MPI approach. A two-dimensional system is shown for convenience. There 
are two MPI processes, each containing four OpenMP threads, (a) MDUnit and MDManager represent an MPI process and 
an OpenMP thread, respectively. Each domain is assigned to an MDUnit, and an MDManager takes responsibility for their 
communication, (b) Communication between processes. Since a platform may not support MPI calls from a multithread 
environment, data should be packed before being sent by a sender process, and the data should be distributed to destinations 
by a receiver process. 

Although there are several strategies for the parallelization of MD [l5| , the domain decomposition method is simple 
and generally scalable for a huge number of processes. It is also the most efficient for particle systems with short- 
range interactions if the interaction length is sufficiently shorter than the size of the domain assigned to each process. 
Domain decomposition parallelization is usually implemented by the flat-MPI approach. This approach allows a 
programmer to consider only a single programing model, and therefore, the development cost is reasonable for huge 
scale computations. However, limited resources, usually a lack of memory, may prevent the adoption of the flat-MPI 
programming model. Thus, the hybrid programming model is required to reduce the number of MPI processes. One 
method of implementing hybrid parallelized MD is to adopt domain decomposition for internode communication 
with the MPI and particle decomposition for intranode communication with loop-level OpenMP. One advantage of 
this method is that the implementation can be simple. By inserting an OpenMP directive before the loop of force 
calculations, one can modify a flat-MPI code to an OpenMP/MPI hybrid code. However, there are some disadvantages 
of this method. Construction of the pair list typically costs 10% of the total computation time. If the parallelization 
efficiency of the force computation within a node is perfect, then the construction of the pair list may become a new 
bottleneck in simulations. Therefore, one has to implement OpenMP parallelization of the pair list construction, as 
well as the force calculation. Additionally, the observation can also be a bottleneck. Therefore, a programmer has to 
consider all routines and implement hybrid versions of them if necessary. This pushes up the development cost. 

In this study, we adopt a pseudo-flat-MPI approach, i.e., we adopt a full domain decomposition strategy for both 
intra- and internode parallelizations by using thread IDs as ranks of the MPI fl^ . The advantages of the pseudo-flat- 
MPI approach as follows, with details will be described later. 

Simple implementation: If one already has a flat MPI code, it is straightforward to modify it to a pseudo-flat-MPI 
code. 

Flexibility: It is easy to change the number of threads per process without rebuilding the program. 

Affinity v^rith CPU-level tuning: For the naive loop-level OpenMP paradigm, one has to consider OpenMP par- 
allelization and SIMD (single instruction multiple data) optimization simultaneously. The pseudo-flat-MPI 
paradigm allows a programmer to consider only SIMD optimization. 

Memory locality: For systems with Non-Uniform Memory Access (NUMA) architecture, a programmer has to 
consider memory locality, i.e., which data should be assigned to which CPU core. In the pseudo-flat-MPI 
paradigm, the memory locality is automatically satisfied as for the flat-MPI paradigm. 

In the following, we describe our implementations in CH — h language since we have written our codes in CH — h. 
However, similar approaches are possible in C or Fortran 77 or other non object-oriented languages. We define two 
classes; MDManager and MDUnit. An MDUnit class represents an OpenMP thread (see Fig. [2] (a)). The system is 
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divided into small domains, and one domain is assigned to one instance of the MDUnit class, i.e., the instance of the 
MDUnit class is the owner of the domain and is responsible for updating the momenta and coordinates of particles 
in the domain. An MDManager class represents an MPI process. It manages at least one instance of MDUnit and is 
responsible for communication between domains. Sample codes for the initialization of MDUnit are as follows. 

MDUnit *mdp; 

std: : vector <MDUnit *> mdv; 

#pragma omp parallel shared (mdv) private (tid,mdp) 
{ 

tid = omp_get_thread_num() ; 
mdp = new MDUnit (tid); 
#pragma omp critical 
mdv . push_back (mdp) ; 

> 

Since variables are initialized in each constructor of MDUnit, the memory locality is automatically satisfied. The number 
of threads per process corresponds to the number of instances of MDUnit managed by an instance of MDManager. If 
one MDManager has only one instance of MDUnit, then the computation is performed in the manner of the flat-MPI 
paradigm; otherwise it is performed in the manner of the hybrid. The number of threads per process is easily changed 
by changing the value of the DMP_NUM_THREADS environment variable. A sample code for calculating force is as follows. 

#pragma omp parallel for schedule (static) 
for(int i = 0; i < num_threads; i++) { 
mdv [i] ->CalculateForce ; 

} 

Here, num_threads is the number of threads per process. The implementation of the MDUnit class does not depend 
on whether it is called by a thread or a process except is the case of communication. Therefore, it is easy to modify 
an existing flat-MPI code to a pseudo-flat-MPI code. 

Communication involving hybrid parallelization should be considered since there are several levels of support for a 
call of MPI routines in a multithread environment. The specifications of MPI 2.2 define the following four levels of 
support [l7| . 

• MPI_THREAD_SINGLE: MPI calls in the OpenMP parallel region are not supported. 

• MPI_THREAD_FUNNELED : The process may be multithread, but the application must guarantee that only the 
main thread makes MPI calls. 

• MPI_THREAD_SERIALIZED: The process may be multithread and any thread may make MPI calls. However, the 
application must guarantee only one call at a time, i.e., consecutive MPI calls from different threads should be 
serialized. 

• MPI_THREAD_MULTIPLE: Any thread may make MPI calls without any restrictions. 

If the system supports MPI_THREAD_MULTIPLE, the implementation of communication in the pseudo-flat-MPI becomes 
identical to that in the flat-MPI. Each instance of MDUnit can communicate with others as if it is a process while 
it is a thread. However, most platforms do not support MPI_THREAD_MULTIPLE. For example, the support level of 
both sites we use in this study is MPI_THREAD_SERIALIZED. Therefore, a programmer has to explicitly manage intra- 
and internode data exchanges. The data required from other processes should be packed before being sent by a 
sender process and they are distributed by a receiver process (see Fig. [21 (b)). Note that one does not have to pack 
the data for communication if the system supports MPI_THREAD_SERIALIZED. However, making MPI calls outside 
OpenMP parallel regions is safe and portable, since it will work on platforms that support only MPI_THREAD_SINGLE. 
This two-level communication is also required for observation routines. Therefore, one has to rewrite all codes for 
observations even if one has already developed parallelized observation codes for a flat-MPI program. This is one of 
the disadvantages of the pseudo-flat-MPI approach. 

We do not consider any load balancing. The dimensions of each domain assigned to a thread are identical among 
domains and do not change throughout the simulations. Although load imbalance may decrease the efficiency of 
simulations, the decrease is not serious for our simulations. The reason for this is described in Sec. IIVDI 
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Facility System CPU Cores Memory Nodes 

ITC PRIMEHPC FXIO SPARC64 IXfx (1.848 GHz) 16 32 GB 4800 
ISSP SGI Altix ICE 8400EX Intel Xeon X5570 (2.93 GHz) 8 24 GB 1920 

TABLE I: Summary of (ITC) PRIMEHPC FXIO at the Information Technology Center of the University of Tokyo and (ISSP) 
SGI Altix ICE 8400EX at the Institute for Solid State Physics of the University of Tokyo. The facility, name, CPU, number of 
cores per node, memory per node, and number of nodes in the system are shown for each system. Although the total number 
of nodes at ISSP is 1920, we only use 128 nodes in this study. 



III. BENCHMARK RESULTS 
A. Details of Platforms 

We perform simulations on two platforms. One is PRIMEHPC FXIO at the Information Technology Center (ITC), 
the University of Tokyo. Each node contains a SPARC 64 IXfx 1.848 GHz processor containing 16 CPU cores. 
The memory access is uniform from each core, i.e., this system adopts Uniform Memory Access (UMA). Each node 
contains 32 GB and the total amount of memory is 150 TB. We used Fujitsu C/C++ Compiler Ver. 1.2.1 with the 
compile options -Kfast -Ksimd=2 -Knoparallel -Kopenmp. Note that the compile options supported by the Fujitsu 
C/C++ compiler can be different for different sites. 

The other is SGI Ahix ICE 8400EX at the Institute for Solid State Physics (ISSP), of the University of Tokyo. 
Each node contains two Intel Xeon 2.93 GHz processors containing 4 CPU cores, and each node adopts cache- 
coherent NUMA (ccNUMA). The total system contains 1920 nodes and provides 180 teraflops. Each node contains 
24 GB and the total amount of memory is 46 TB. We used Intel C++ Compiler Ver. 11.1 with the compile options 
-03 -ip -ipo -XSSE4.2 -axSSE4.2 -openmp. A summary of the platforms is shown in Table HI 

Although we perform product runs of cavitation phenomena on both FXIO and SGI Altix ICE 8400EX, we perform 
benchmark simulations only on FXIO at ITC, since the benchmark results on ISSP have already been reported [5||. 

B. Comparison between Flat MPI and Hybrid 

The conditions for our benchmark simulations are as follows fs']. All quantities are measured in LJ parameters, 
such as the particle diameter cr, and the energy scale e. 

• The dimensions of each domain are 100 x 100 x 100 and one domain is assigned to one MPI process (flat MPI) 
or one OpenMP thread (hybrid). 

• The number density is 0.5, i.e., each domain contains 500,000 particles. 

• Initial condition: Face-centered-cubic lattice. 

• Boundary condition: Periodic for all axes. 

• Integration scheme: Second-order symplectic integration. 

• Time step: 0.001. 

• Cutoff length: 2.5. 

• Cutoff scheme: Add constant and quadratic terms to potential as shown in Eq. ([l}. 

• Initial velocity: The absolute value of the velocity of all particles is set to 0.9 and the directions of the velocities 
are given randomly. 

• After 150 steps, we measure the calculation (elapsed) time for the next 1000 steps. 

Computational speeds are measured in the unit MUPS, which is millions of updates per second. When a system with 
N particles is simulated for k steps in t seconds, then the number of MUPS is given by 10~^ Nk/t. The number of 
nodes is increased while keeping the same number of particles on each node (weak scaling). We apply two paradigms, 
flat-MPI and the OpenMP/MPI hybrid for each size of simulation. In the flat-MPI case, we assign 16 processes on 
a node (one MPI process to one CPU core). In the hybrid case, we assign one MPI process containing 16 OpenMP 
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threads on a node (one OpenMP thread to one CPU core). We use identical executable file for the two models and 
we assign domains to nodes so that the same part of the system is assigned to the same CPU core for the flat- 
MPI and hybrid. Therefore, the simulations are completely identical for the flat-MPI and hybrid runs including the 
communication patterns across nodes. 

The results of the flat-MPI are summarized in Table and those of the hybrid are summarized in Table IIIII The 
elapsed times of the runs are shown in Fig. [3l Since we perform weak scaling analysis, the elapsed time will be 
independent of the number of nodes for perfect parallel efficiency. As shown in Fig. [31 while the parallel efficiency 
of flat-MPI is almost perfect, that of the hybrid runs fluctuates. The performance of flat-MPI is always better 
than that of the hybrid. Since the computations are identical for the two methods, the difference originates from 
the parallel overheads of OpenMP. There are several possible causes of the overheads. One of them is the cost of 
thread management. The creation and destruction of threads are required for OpenMP. Additionally, the number of 
synchronizations per step is larger than in flat-MPI. Although only one global synchronization per step is required for 
a flat-MPI, several intranode synchronizations are required for the hybrid. Even for a perfect load balance, frequent 
synchronization may cause parallel overhead due to noise from the system. Note that the total memory required 
for the hybrid is less than that for flat-MPI, as expected. For the largest run involving 4800 nodes, the hybrid run 
consumed 10 GB per node out of 32 GB while flat-MPI consumed 14 GB. 



Nodes Number of Particles Elapsed Time [s] Speed [MUPS] Efficiency 



1 


8,000,000 


445.815 


17.9447 


1.0 


72 


576,000,000 


448.275 


1284.93 


0.99 


480 


3,840,000,000 


452.781 


8480.93 


0.98 


1440 


11,520,000,000 


456.641 


25227.7 


0.98 


4800 


38,400,000,000 


458.399 


83769.9 


0.97 



TABLE II: Results of benchmark simulations of flat-MPI. The number of nodes, number of particles, elapsed time [s], compu- 
tational speed [MUPS], and parallel efficiency are listed. The parallel efficiency is determined relative to the elapsed time for 
the single-node calculation. 



Nodes Number of Particles Elapsed Time [s] Speed [MUPS] Efficiency 



1 


8,000,000 


585.160 


13.6715 


1.00 


72 


576000000 


627.449 


918.003 


0.93 


480 


3840000000 


724.139 


5209.34 


0.79 


1440 


11520000000 


724.139 


15908.5 


0.81 


4800 


38,400,000,000 


684.871 


56068.9 


0.85 



TABLE III: Same as Table [IT] for hybrid runs. 



C. Optimization for FXIO 

The results shown in the previous section are obtained using a code optimized for Intel Xeon architectures. The 
performance of the largest job is 56.5 teraflops, which corresponds to an approximately 5% execution efficiency. To 
improve the efficiency and performance of simulations, we optimize the force calculation routine for FXIO. The 
algorithm optimized for FXIO is given as Algorithm [U where N is the total number of particles, q[i] is the position 
of particle i, p[i] is the momentum, and dt is the time step. The variables in bold denote three-dimensional vectors, 
e.g., q = {qx,qy,qz). The constants C2 and cq are coefficients introduced for the truncation in Eq. ([1}. We find 
that store operations involving indirect addressing is expensive for FXIO. Therefore, we avoid random storing by not 
using Newton's third law. While the number of iterations required to calculate forces (the length of SortedList) is 
doubled without Newton's third law, the computational efficiency increases more than twofold. We also utilize SIMD 
operations. The variables with hats in Algorithm [T] denote the packed variables. A packed variable is stored in 128-bit 
registers and contains two double-precision 64-bit numbers. Suppose there are two double-precision numbers a and 
b and a packed variable c. Packing a and b into c is denoted as c {a, 6} and unpacking c to a and b is denoted 
as {a, b} -It- c. To hide the latency by software pipelining, the inner loop is unrolled four times. Since the two-body 
potential is truncated at the cutoff length rc, the force calculation involves checking whether or not the distance 
between a particle pair is shorter than the cutoff length. To perform the check with SIMD operations, a new variable, 
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FIG. 3: Results of benchmark simulations for flat-MPI and hybrid runs (weak scaling). Elapsed times for 1000 steps are shown. 
Although the elapsed times of flat-MPI are almost independent of the number of nodes, the results of hybrid runs fluctuate. 
Additionally, computational speed of the flat-MPI is always higher than that of the hybrid. The data of flat-MPI using the 
optimized code for FXIO is also shown. The optimized code is about 25% faster than the original code. 



w = 1 — [r^/r^J, is introduced, where \_x\ denotes the largest integer not larger than x. The variable w can be used 
as a mask for the truncation of the potential, since w = 1 when the distance between the particle pair is shorter than 
the cutoff length and w — otherwise. Note that each distance between a particle pair in a pair list should be less 
than ^/2rc to use this mask. We construct a pair list so that this condition is always satisfied. 

The force calculation of the LJ interaction involves at least one division for each particle pair. However, the 
floating-point divide instruction (fdivd) in FXIO cannot be specified as a SIMD instruction. Therefore, we use 
the floating-point reciprocal approximation instruction (frcpad) instead of the divide instruction. The reciprocal 
approximation instruction is faster than the divide instruction at the expense of precision. The rounding error due 
to the approximation is up to 1/256, i.e., 



frcpad(a;) ~ l/x 



l/x 



which is insufficient for MD simulations. Therefore, we improve the precision by the following error correction. 
Suppose r is the particle distance. The force calculation of 6-12 LJ interactions requires the values of and r^^*^. 
Consider the calculation of from r^. Let be a reciprocal approximation of r^, i.e., = frcpad(r^). The 
rounding error due to the approximation can be expressed as 

p-^r^ = l + e, (|e| < 1/256) (3) 

with error e. Then the required value is expressed as 

(4) 

(5) 
(6) 
(7) 

Therefore, the error of the corrected value (f^^)''(5 — 4f^^r^) relative to r^^ is 0{e'^). Similarly, the correction 
coefficient for r^^'* is obtained to be (8 — 7f^^r^). The effect of the error correction is shown in Fig. |31 One can see 
that the conservation of the total energy is improved by the error correction. 

The performance of the optimized code is summarized in Table HVl and shown in Fig. [3] as the data denoted by Flat 
MPI (Optimized). The increase in speed due to optimization is about 25% compared with the original code with flat 
MPI. Although the performance also increases to 193 teraflops, which corresponds to a 17.0% execution efflciency, 
one has to note that it contains redundant computations owing to the nonuse of Newton's third law. 
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Algorithm 1 Code for calculating force optimized for FXIO. 

1: fc ^ {rc,rc} 

2: Ci ^ {ci, ci} 

3: tt^ {1,1} 

4: <- {dt, dt} 

5: for i = 1 to iV - 1 do 

6: q, ^ {q[i],q[i]} 

7: I [(KeyPointer[j + 1] - KeyPointer[i])/4j 

8: ko KeyPointer[i] 

9: ^{0,0} 

10: for fc = to Z - 1 do 

11: fc ^ fco + 4fci 

12: ja -i— SortedList[fc] 

13: jb ^ SortedList[fc + 1] 

14: jc ^ SortedList[fc + 2] 

15: jd ^ SortedList[fc + 3] 

16: ciA ^ {q[>].qb6l} 

17: cIb ^ {q[jc],q[id]} 

18: f A qA - Qi 

19: fi3 ^ q^ - q, 

20: ri^lqip 

21: f| ^ Iq^l^ 

22: WA^u - \r\lrl\ 

23: i&is ^ M - Lr|/f?J 

24: /a ^ [(24fi - 48)/f3l' + 8ci]dW 

25: /i3 ^ [(24f| - 48)/fl;'' + 8ci]dttl)s 

26: Pi Pi + /ArA + /srs 

27: end for 

28: {Pi,P2}4-p, 

29: p[i] ^ p[i] + Pi + Pa 

30: for k = KeyPointer[i] +4/ to KeyPointer[i + do 

31: j ^ SortedList[fc] 

32: r ^ q[j] - q[i] 

33: ^ |r|^ 

34: if < rl then 

35: / ^ [(24r« - 48) /r" + 8c2]dt 

36: p[j] ^ p[i] + /r 

37: end if 

38: end for 

39: end for 



Nodes Number of Particles Elapsed Time [s] Speed [MUPS] Efficiency 



1 


8,000,000 


351.821 


22.7389 


1.00 


256 


2,048,000,000 


355.829 


5755.57 


0.99 


512 


4,096,000,000 


362.775 


11290.7 


0.97 


1024 


8,192,000,000 


368.275 


22244.2 


0.96 


2048 


16,384,000,000 


364.272 


44977.4 


0.97 


4096 


32,768,000,000 


373.007 


87848.3 


0.94 


4800 


38,400,000,000 


364.920 


105229.0 


0.96 



TABLE IV: Same as Tables [TI] and [TTT] for the flat-MPI code optimized for FXIO. 

IV. MULTIBUBBLE NUCLEI 

The study of multibubble nuclei is challenging, since it is a typical multiscale and multiphysics phenomenon that 
involves phase transitions at the microscopic scale and interbubble interactions at the meso- or macroscopic scale. 
From the viewpoint of computational simulations, phase transitions make simulations difficult since they involve the 
creation, movement, and annihilation of phase interfaces. If one simulates bubble nuclei using MD, then there is 
no need to consider phase transitions and phase interfaces. Both of these occur spontaneously in the simulation 
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FIG. 4: Effect of error correction on reciprocal approximation. The number of particles is 4000. Computations are performed 
on a single core. The open circles denote the results without the error correction and the filled circles denote those with the 
error correction. The energy conservation with the error correction is sufficient. 
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FIG. 5: Phase diagram of the system. The solid lines denote the gas-liquid coexistence densities. The open circle denotes the 
initial condition of the system. After the expansion, the state of the system moves to the point denoted by the solid circle. 



as a result of the dynamics of particles. However, if one wants to reproduce multibubble nuclei by MD, then a 
huge number of particles are required; therefore, a large-scale simulation is necessary. There are several reports on 
extremely large-scale MD simulations. A team from Lawrence Livermore National Laboratory and IBM studied the 
Kelvin-Helmholtz instability by MD involving 62.5 billion particles on BlueGene/L, for which they won the 2007 
Gordon Bell Prize [l^. In 2008, Germann and Kadau demonstrated one-trillion-atom MD on BlueGene/L using 
single precision [l^. However, to our best knowledge, there have been studies of large MD simulations involving a 
phase transition carried out on systems containing over 10,000 CPU cores. 

In this study, we perform large-scale MD simulations of multibubble nuclei. There are two types of processes 
involving bubble nuclei; boiling and cavitation. Boiling occurs when a liquid is heated under a constant-pressure 
condition and cavitation occurs when the pressure of the liquid is reduced. We study cavitation phenomena since a 
simulation of cavitation requires only a thermostat, while that of boiling requires both a thermostat and a valostat. 
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Site Scheme Processes Threads System Size Particles Total Steps Elapsed Time [h] 

ISSP Flat-MPI 1,024 1,024 320 x 320 x 320 22,937,600 690,000 10 

ITC Hybrid 4,800 76,800 1,440 x 1,200 x 1,200 1,449,776,020 178,000 15 

TABLE V: Details of conditions. The parallelization scheme, the numbers of processes and threads used for the jobs, the 
dimensions of the system, the total number of particles, the total number of steps, and the elapsed time are shown for each 
site. 

A. Cavitation Process 

We first maintain the system at a temperature T = 0.9 with a density p = 0.7 using a thermostat. This corresponds 
to the pure-liquid phase (see Fig. [S] for the phase diagram.). We use the Nose-Hoover thermostat to control tem- 
perature [l^]. Then we expand the system. The expansion is performed by changing the length scale of the system 
from I to al. The rescaling factor a is chosen to be larger than unity. With this change, the position of each particle 
also changes from {qx,qy,qz) to {aqx,oiqy^aqz) while keeping the momenta. This procedure models a uniform and 
adiabatic expansion. We chose the rescaling factor to be 1.025. Then the density changes from 0.7 to 0.65, which is in 
the gas-liquid coexistence region in the phase diagram; therefore, bubbles start to appear. After expansion, we turn 
off the thermostat so that it does not affect the physical processes of the bubble nuclei. The integration scheme used 
for the isothermal time evolution is the second-order reversible system propagator algorithm (RESPA) [2l|, and the 
second order symplectic method is used for the microcanonical simulation with a time step At = 0.005. We perform 
a small job at ISSP and a large job at ITC. Hereinafter, we denote the smaller and larger jobs by ISSP and ITC, 
respectively. The conditions of the jobs are summarized in Table fVl 

B. Bubble Identification 

To identify bubbles, we divide the system into small subcells with a length of 3.0 and determine the local density 
for each subcell [22i|. The densities of the gas and liquid coexisting in this system at T = 0.9 are estimated to be 
0.0402(2) and 0.6730(2), respectively [2^. Therefore, we define a subcell to be in the gas state when its density is less 
than 0.2. We have confirmed that the results do not change for other threshold values such as 0.3. We consider that 
neighboring gas state cells are in the same cluster and identify the bubbles using the site-percolation criterion on the 
simple cubic lattice. This process is called clustering, and the parallelization of clustering is quite expensive since it 
involves global communication similar to that of the fast fourier transform. Therefore, we gather all the data of local 
densities to the root node and perform clustering with serial calculation. To reduce memory consumption, we count 
the number of particles in each subcell and store it as an integer (data type unsigned char in C language) instead 
of storing the local density with floating-point variables. Since the volume of each subcell is 27 and the local density 
hardly exceeds 1.0, the number of particles in a subcell is at most 27, which can be stored in one byte without losing 
any information. A snapshot of the data is 1.4 MB at ISSP and 76.8 MB at ITC. To perform clustering, more work 
memory is required in addition to the data of the local densities. 

C. Simulation Results 

We perform the job with flat-MPI at ISSP, whereas we perform the job with the hybrid at ITC in order not to use 
too much memory. The time evolution of pressure in each case is shown in Fig. [6l Although the pressure is positive 
when the system is in the pure liquid phase, it suddenly becomes negative after the expansion and increases as bubbles 
appear and grow. Snapshots obtained during the runs are shown in Figs. [7] and H) In the job at ISSP, Ostwald-like 
ripening is observed after multibubble nuclei, and finally only one bubble remains in the system. Multibubble nuclei 
and Ostwald-like ripening are also observed at ITC. 

The time evolutions of the total number of bubbles are shown in Fig.|n](a). After the rapid increase in the number of 
bubbles, the number decreases slowly owing to interactions between bubbles, i.e., small bubbles are eliminated owing 
to the growth of larger bubbles. The power-law decay of the number of bubbles can be observed. The distributions of 
the bubble sizes at the 10,000th step after expansion are shown in Fig.[9](b). Although the results of both simulations 
show similar distributions, one can see that the distribution for the jobs at ISSP appears to be insufficient for further 
analysis. 
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FIG. 6: Time evolutions of pressure. First the system is maintained in the pure hquid phase. At f = 50, the system expands 
and the pressure becomes negative. As bubbles appear and grow, the pressure increases after the expansion. The behaviors of 
the two jobs are almost identical. 
























FIG. 7: (Color online) Snapshots of bubbles obtained during runs at ISSP. The direction of time evolution is from left to right. 
In the early stages, many bubbles appear and grow independently. Afterward, the bubbles start to interact and Ostwald-like 
ripening, where larger bubbles become larger and smaller bubbles disappear, is observed. Finally, one large bubble remains in 
the system. These pictures were produced by ParaView [Sj • 

D. Load Imbalance 

In the final part of this section, we discuss the load imbalance of our simulations. To check the validity of the pair 
list, we synchronize all threads at every time step. Therefore, the slowest thread determines the overall performance 
since the other threads must wait for the slowest thread due to the synchronization. The computational workload of 
a domain is roughly proportional to the number of particles it contains. Since the volumes of domains are identical, 
the workload is proportional to the number density. Since a liquid is almost incompressible, the largest density 
is determined by the coexisting density of the liquid at the temperature of the system. In our simulation, the 
temperature becomes 0.88 after the expansion. The densities of the coexisting gas pg and liquid p\ at this temperature 




FIG. 8: (Color online) Snapshots of bubbles obtained during runs at ITC. The direction of time evolution is from left to right. 
Multibublsle nuclei and the Ostwald-like ripening are observed. These pictures were produced by POV-Ray [2^ . 
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FIG. 9: (a) Time evolution of the total number of bubbles. The decimal logarithms are taken for both axes. The dashed line 
with a slope of -1.7 is a guide to the eyes, (b) Size distributions of bubbles at the 10,000th step after expansion. The decimal 
logarithms are taken for both axes. 




FIG. 10: Schematic illustrations of load imbalance, (a) Typical configuration of simulation of cavitation. There are many 
bubbles. The domain denoted by A is filled with pure gas, and the domain denoted by B is filled with pure liquid, i.e., it does 
not contain any bubbles, (b) Workloads of threads are proportional to the densities of the assigned domains. The dashed line 
denotes the average density of the system, which is the workload of domains for a perfect load balance. The workload of thread 
B is about 20 times larger than that of thread A. However, it is only 5% larger than the average workload. 



are pg = 0.0344 and p\ — 0.687, respectively [2^. Therefore, the computational time required for one step of the 
slowest process is proportional to pi, which is 20 times larger than that of the fastest thread. If a perfect load balance 
is achieved, all threads have identical numbers of particles, and the densities in the domains become the same as 
the total number density pavc- In our simulation, the average density after expansion is pavo = 0.65. Therefore, the 
computational time of the simulation without load balancing is proportional to pi and that with a perfect load balance 
is proportional to pave = 0.65. The reduction in speed due to the load imbalance is given by the ratio of the average 
workload to the heaviest workload, which is 1 — Pave/pi ~ 0.05, i.e., only 5% (see Fig. [TU]). This means that the 
load imbalance is minor problem in studies of cavitation phenomena. If one wants to study condensation phenomena, 
which involve droplet nuclei, then the problem of load imbalance becomes serious. Since the average density is close 
to that of the gas while the slowest thread is determined by the density of the liquid, the simulation performance can 
be 20 times worse than that of the perfectly load-balanced simulation. 



V. SUMMARY AND DISCUSSION 



We have developed a pseudo-flat MPI MD code and performed benchmark and product runs utilizing up to 4800 
nodes, and 76,800 cores. Benchmark simulations contained up to 38.4 billion particles and exhibited almost perfect 
scaling for flat-MPI. We studied cavitation phenomena with MD involving up to 1.45 billion particles and observed 
multibubble nuclei and Ostwald-like ripening. Bubbles appear as a result of the many-body interaction of particles. 
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and Ostwald-like ripening appears as a result of the many-body interaction of bubbles. This shows that direct 
simulations of multiscale physics at the atomic scale are now possible. In the case of more than 1 billion particles, 
the population dynamics of a bubble-size distribution can be observed with satisfactory statistics. This will allow 
the study of the nonequilibrium dynamics of multibubble interactions. We executed our codes on two different sites 
with different methods of parallelization, thus showing that our codes are portable and flexible. The flat-MPI method 
exhibited better performance than the hybrid, while the computation assigned to each core was identical. Another 
issue is to identify the main factor contributing to the parallel overhead of OpenMP. The amount of memory per 
computational power on a node will decrease as the number of cores per node increases. This trend will make it 
difficult to perform simulations with flat-MPI in the future. The pseudo-flat MPI approach allows us to perform a 
job with flat-MPI to obtain better performance or with the hybrid to reduce memory consumption. The developed 
code is available online [2611. 
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